Contributors:
Dominika Gerszewska
Marcin Sidoruk
Andrzej Łososowski
Joanna Zielińska
The Chubb-Chubbs:
This database contains information about the top 200 daily streaming songs on Spotify for over three years.
It includes a wealth of information for each track, gathered via Spotify's API, such as the artist, country, genre, and other relevant details.
To simplify the data, the popularity of each song has been aggregated into a single score.
This Spotify database is a valuable resource for anyone interested in music or data analysis.
The goal of this project is to explore and analyze Spotify's daily top 200 streaming songs data over a period of three years.
The project includes a variety of visualizations and analyses, such as identifying the most popular music genres, creating a map of average popularity, analyzing popularity by language, examining musical diversity, and identifying the most frequently occurring genres by country.
Additionally, the project includes a feature that allows users to input a specific genre and view the top 10 countries where that genre is most popular.
The business objective of this project is to provide valuable insights for artists and music industry professionals who are looking to understand music trends and identify opportunities for market entry.
For example, an artist could use this information to determine which countries to target when promoting their music based on the popularity of their genre in different regions.
Similarly, music industry professionals could leverage this data to make informed decisions about marketing and distribution strategies.
#requirements
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.max_columns', 151)
# Loading data #1
#Importing the database with selected columns
df = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Popularity', 'Genre'])
df_1 = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Genre', 'Artist','Title','Album','Cluster','Popularity','Artist_followers'])
# Loading data #2
#Adding extra data set to use in plotly for interpetation country
df_country_iso = pd.read_csv('Country_ISO\countries_codes_and_coordinates.csv')
df_country_iso = df_country_iso.replace('"','', regex=True)
df_country_iso = df_country_iso.replace('United Kingdom', 'UK') # adjusting to data in Spotify dataset
# Loading data #3
# Creating dictionary to add 3 letters shortcut to datasetkraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
kraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
iso = list(df_country_iso['Alpha-3 code']) #wyciągnięcie skrótów krajów z iso
dict = {}
iso = [x.strip(' ') for x in iso] # usnięcie spacji ze skrótów kodów
for i,j in zip(kraj,iso): # tworznie słownika na bazie którego zostanie zapełniona kolumna iso_alpha z df
dict.setdefault(i,j)
# Loading data #4
df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie
df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania potrzbenych krajów
df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie
df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania pot
df_1.head()
| Country | Popularity | Title | Artist | Genre | Artist_followers | Album | Cluster | |
|---|---|---|---|---|---|---|---|---|
| 0 | Global | 31833.95 | adan y eva | Paulo Londra | argentine hip hop | 11427104.0 | Adan y Eva | global |
| 1 | USA | 8.00 | adan y eva | Paulo Londra | argentine hip hop | 11427104.0 | Adan y Eva | english speaking and nordic |
| 2 | Argentina | 76924.40 | adan y eva | Paulo Londra | argentine hip hop | 11427104.0 | Adan y Eva | spanish speaking |
| 3 | Belgium | 849.60 | adan y eva | Paulo Londra | argentine hip hop | 11427104.0 | Adan y Eva | english speaking and nordic |
| 4 | Switzerland | 20739.10 | adan y eva | Paulo Londra | argentine hip hop | 11427104.0 | Adan y Eva | english speaking and nordic |
df_1.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 170633 entries, 0 to 170632 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 Country object 1 Popularity float64 2 Title object 3 Artist object 4 Genre object 5 Artist_followers object 6 Album object 7 Cluster object dtypes: float64(1), object(7) memory usage: 10.4+ MB
# Data cleansing
df = df.replace('n-a', np.nan)
df = df.dropna()
df_1 = df_1.replace('n-a', np.nan)
df_1 = df_1.replace('southern europe and portuguese heritage', 'Portuguese heritage')
df_1 = df_1.dropna()
drop_index_cl = df_1[df_1.Cluster == 'global'].index
drop_index_c = df[df.Country == 'Global'].index
df.drop(drop_index_c,inplace=True)
df_1.drop(drop_index_cl,inplace=True)
Counutries = df_1['Country'].nunique()
Genres = df_1['Genre'].nunique()
Titles = df_1['Title'].nunique()
Albums = df_1['Album'].nunique()
Artist = df_1['Artist'].nunique()
df_unique = pd.DataFrame({'Countries': [Counutries],'Genres':[Genres], 'Artist': [Artist] , 'Albums':[Albums], 'Title': [Titles],})
df_unique.style.hide_index()
| Countries | Genres | Artist | Albums | Title |
|---|---|---|---|---|
| 34 | 1119 | 23347 | 32633 | 44930 |
# mean to show on map
by_country = df.groupby('iso_alpha')['Popularity'].mean().reset_index().rename(columns={'iso_alpha': 'Country','Popularity':'Mean Popularity'})
# mean to show on map2
uniq = df_1.groupby(['Country','Cluster'])['Popularity'].count().reset_index().sort_values(by = 'Country')
country = df_1.groupby(['Country','Cluster'])['Popularity'].mean().reset_index().rename(columns={'Popularity':'Mean_Popularity'}).sort_values(by = 'Country')
uniq['Mean_Popularity'] = country['Mean_Popularity']
# mean to show on map3
country_list = by_country
fig = px.choropleth(country_list, locations='Country',
color='Mean Popularity', #
hover_name='Country', # column to add to hover information
color_continuous_scale=px.colors.sequential.Rainbow,
width=600,
height=600,
projection = 'mercator')
fig.update_layout(title='Map of countries')
fig.show()
fig = px.sunburst(country,
path=['Cluster','Country'],
values='Mean_Popularity',
color='Mean_Popularity',
color_continuous_scale=px.colors.sequential.Rainbow,
width = 600,
height = 800,
title= 'Distribution by language'
)
fig.show()
country_en = country[country.Cluster == 'english speaking and nordic'].sort_values(ascending=False, by = 'Mean_Popularity')
country_spanish = country[country.Cluster == 'spanish speaking'].sort_values(ascending=False, by = 'Mean_Popularity')
country_portuguese = country[country.Cluster == 'Portuguese heritage'].sort_values(ascending=False, by = 'Mean_Popularity')
fig = make_subplots(rows=1, cols=3, subplot_titles=( "Spanish speaking", "Portuguese heritage", "English speaking and nordic",), shared_yaxes=True, horizontal_spacing=0.1)
fig.add_trace(go.Bar(x=country_en.Country, y=country_en.Mean_Popularity), row=1, col=3)
fig.add_trace(go.Bar(x=country_spanish.Country, y=country_spanish.Mean_Popularity), row=1, col=1)
fig.add_trace(go.Bar(x=country_portuguese.Country, y=country_portuguese.Mean_Popularity), row=1, col=2)
fig.update_layout(height=400, width=1000,
title_text="Mean popularity in each country for language cluster", showlegend=False, yaxis_title='Mean popularity', xaxis_title='Country')
fig.update_traces(width=0.4)
fig.show()
uniq = uniq.sort_values(ascending=False, by = 'Popularity')
fig = px.bar(uniq,y=uniq.Country,
x=uniq.Popularity,
labels={'Country':'Country', 'Popularity':'The number of occurrences'},
color = 'Mean_Popularity',
color_continuous_scale = px.colors.sequential.Rainbow,
orientation='h')
fig.update_layout(title='Number of songs that were on top list 200 in each country', height=800)
fig.update_traces(width=0.4)
fig.show()
count_genre2 = df.groupby('Country')['Genre'].nunique().sort_values(ascending= False)
fig = px.bar(y=count_genre2.index, x=count_genre2.values, labels={'x':'The number of different genres', 'y':'Country'}, orientation='h')
fig.update_layout(title='Countries with number of genre diversity ', height=800)
fig.update_traces(width=0.4)
fig.show()
genre_counts = df['Genre'].value_counts().nlargest(10)
fig = px.bar(x=genre_counts.index, y=genre_counts, labels={'x':'Genre', 'y':'The number of songs'})
fig.update_layout(title='Top 10 most popular music genres')
fig.update_traces(width=0.4)
fig.show()
result = df.groupby('Country')['Genre'].apply(lambda x: x.value_counts().nlargest(1)).sort_values(ascending=True).reset_index(name='Counts')
result.rename(columns = {'level_1' : 'Legend'}, inplace=True)
wykres2 = px.bar(result, y='Country', x='Counts', color='Legend', orientation='h')
wykres2.update_traces(textposition='inside',width =0.4)
wykres2.update_layout(xaxis_title='Number of occurrences',
yaxis_title='Country',
height=800)
wykres2.update_layout(title='The most common genre in each country')
wykres2.show()
poland_counts = df.query('Country == "Poland"')['Genre'].value_counts().nlargest(10)
turkey_counts = df.query('Country == "Turkey"')['Genre'].value_counts().nlargest(10)
ecuador_counts = df.query('Country == "Ecuador"')['Genre'].value_counts().nlargest(10)
fig = make_subplots(rows=1, cols=3, subplot_titles=("Poland", "Turkey", "Ecuador"), shared_yaxes=True)
fig.add_trace(go.Bar(x=poland_counts.index, y=poland_counts), row=1, col=1)
fig.add_trace(go.Bar(x=turkey_counts.index, y=turkey_counts), row=1, col=2)
fig.add_trace(go.Bar(x=ecuador_counts.index, y=ecuador_counts), row=1, col=3)
fig.update_layout(height=400, width=1000, title_text="Top 10 most popular music genres in selected country", showlegend=False, yaxis_title='Number of occurrences', xaxis_title='Genre')
fig.show()
fig = px.bar()
#Display the top 10 countries for selected genre
wprowadzony_gatunek = input("Please enter the name of the music genre for which you want to see ordered countries by count: ")
nowy_df = df.loc[df['Genre'] == wprowadzony_gatunek, ['Genre', 'Country','iso_alpha']]
zliczanie = nowy_df['Country'].value_counts()
zliczanie.columns = ['Country', 'Counts']
top_counts = nowy_df[['iso_alpha','Country']].value_counts().reset_index().rename(columns={0 : 'Counts'})
# poloting map from selected countries
country_list = top_counts
fig = px.choropleth(country_list, locations="iso_alpha",
color="Counts", # lifeExp is a column of gapminder
hover_name="Country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Rainbow,
width=600,
height=600,
projection = 'mercator')
fig.show()
# poloting bar from selected countries
fig = px.bar(nowy_df, y=top_counts['Country'], x=top_counts['Counts'], labels={'x':'Number of occurrences', 'y':'Country'}, orientation='h')
fig.update_layout(title=f"Count in countries for selected genre ({wprowadzony_gatunek})", height=800)
fig.show()
Please enter the name of the music genre for which you want to see ordered countries by count: k-pop